:orphan: Sklearn Basics 4: Train a Coclustering ====================================== The steps to train a coclustering model with Khiops are very similar to what we have already seen in the basic classifier tutorials. We start by importing the sklearn estimator ``KhiopsCoclustering`` and defining a helper function: .. code:: ipython3 import os import platform import subprocess import pandas as pd from khiops import core as kh from khiops.sklearn import KhiopsCoclustering # If there are any issues you may Khiops status with the following command # kh.get_runner().print_status() For this tutorial, we use the dataset ``CountriesByOrganization`` that contains the relation country-organization for a large number of countries and organizations (*it is bit outdated though*). The objective is to build a coclustering between Country and Organization and see which countries resemble the most in terms of organizations. Let’s first load this dataset and check its content: .. code:: ipython3 countries_data_file = os.path.join( "data", "CountriesByOrganization", "CountriesByOrganization.csv" ) X_countries = pd.read_csv(countries_data_file, sep=";") print("CountriesByOrganization dataset:") display(X_countries) .. parsed-literal:: CountriesByOrganization dataset: .. parsed-literal:: Country Organization 0 Afghanistan AsDB 1 Afghanistan COLOMBO 2 Afghanistan ECO 3 Afghanistan ICCROM 4 Afghanistan NAM ... ... ... 11187 Zimbabwe WHO 11188 Zimbabwe WIPO 11189 Zimbabwe WMO 11190 Zimbabwe WTO 11191 Zimbabwe WTOURO [11192 rows x 2 columns] Now, let’s build the coclustering model. Note that a coclustering model is learned in an unsupervised way and aims to cluster jointly rows and columns of a matrix. So we need to provide a column name to be able to deploy it on a specific column. We do this by setting the ``fit`` parameter ``id_column``: .. code:: ipython3 khcc_countries = KhiopsCoclustering() khcc_countries.fit(X_countries, id_column="Country") .. raw:: html
KhiopsCoclustering()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Now let’s access the coclustering training report to obtain the cluster information of the ``Country`` dimension. Since in each dimension there is a hierarchical cluster, so we only access the leaf clusters: .. code:: ipython3 countries_clusters = khcc_countries.model_report_.coclustering_report.get_dimension( "Country" ).clusters countries_leaf_clusters = [cluster for cluster in countries_clusters if cluster.is_leaf] print(f"Number of leaf clusters: {len(countries_leaf_clusters)}:") for index, cluster in enumerate(countries_leaf_clusters, start=1): print(f"cluster {index:02d}: {cluster.name}") .. parsed-literal:: Number of leaf clusters: 12: cluster 01: {Germany, France, Netherlands, ...} cluster 02: {United States of America, Canada, Japan, ...} cluster 03: {Poland, Hungary, Turkey, ...} cluster 04: {Kazakhstan, Kyrgyzstan, Azerbaijan, ...} cluster 05: {Venezuela, Nicaragua, Ecuador, ...} cluster 06: {Trinidad and Tobago, Barbados, Grenada, ...} cluster 07: {Niger, Ivory Coast, Benin, ...} cluster 08: {Tanzania, Uganda, Kenya, ...} cluster 09: {Qatar, Saudi Arabia, United Arab Emirates, ...} cluster 10: {Tunisia, Algeria, Morocco, ...} cluster 11: {India, Malaysia, Indonesia, ...} cluster 12: {Papua New Guinea, Fiji, Nepal, ...} The composition of the clusters is also available. For the first one we have: .. code:: ipython3 print(f"Members of the cluster {countries_leaf_clusters[0].name}:") for value_obj in countries_clusters[0].leaf_part.values: print(value_obj.value) .. parsed-literal:: Members of the cluster {Germany, France, Netherlands, ...}: Germany France Netherlands Denmark Sweden Belgium Finland Italy Norway Spain Portugal Austria United Kingdom Luxembourg Switzerland Greece Ireland Iceland The coclustering is a complex model, so it is better to visualize it with the Khiops Co-visualization app. Let’s export the report to a ``.khcj`` file and open it: .. code:: ipython3 countries_report = os.path.join("exercises", "countries.khcj") khcc_countries.export_report_file(countries_report) # explorer_open(countries_report) Finally, let’s deploy the coclustering model on the training data ``countries_df``: .. code:: ipython3 countries_predictions = khcc_countries.predict(X_countries) print("Predicted clusters (first 10)") display(countries_predictions[:10]) .. parsed-literal:: Predicted clusters (first 10) .. parsed-literal:: array([['Afghanistan', '{India, Malaysia, Indonesia, ...}'], ['Albania', '{Poland, Hungary, Turkey, ...}'], ['Algeria', '{Tunisia, Algeria, Morocco, ...}'], ['Andorra', '{Poland, Hungary, Turkey, ...}'], ['Angola', '{Niger, Ivory Coast, Benin, ...}'], ['Antigua and Barbuda', '{Trinidad and Tobago, Barbados, Grenada, ...}'], ['Argentina', '{Venezuela, Nicaragua, Ecuador, ...}'], ['Armenia', '{Kazakhstan, Kyrgyzstan, Azerbaijan, ...}'], ['Australia', '{United States of America, Canada, Japan, ...}'], ['Austria', '{Germany, France, Netherlands, ...}']], dtype=object) Exercise ~~~~~~~~ We’ll build a coclustering model for the ``Tokyo2021`` dataset. It is extracted from the ``Athletes`` table of the `Tokyo 2021 Kaggle dataset `__ and each record contains three variables: - ``Name``: the name of a competing athlete - ``Country``: the country (or organization) it represents - ``Discipline``: the athlete’s discipline The objective with this exercise is to make a coclustering between ``Country`` and ``Discipline`` and see which countries resemble the most in terms of the athletes they bring to the Olympics. We start by loading the contents into a dataframe: .. code:: ipython3 tokyo_data_file = os.path.join("data", "Tokyo2021", "Athletes.csv") X_tokyo = pd.read_csv(tokyo_data_file, encoding="latin1") print("Tokyo2021 dataset (first 10 rows):") display(X_tokyo.head(10)) .. parsed-literal:: Tokyo2021 dataset (first 10 rows): .. parsed-literal:: Name Country Discipline 0 AALERUD Katrine Norway Cycling Road 1 ABAD Nestor Spain Artistic Gymnastics 2 ABAGNALE Giovanni Italy Rowing 3 ABALDE Alberto Spain Basketball 4 ABALDE Tamara Spain Basketball 5 ABALO Luc France Handball 6 ABAROA Cesar Chile Rowing 7 ABASS Abobakr Sudan Swimming 8 ABBASALI Hamideh Islamic Republic of Iran Karate 9 ABBASOV Islam Azerbaijan Wrestling Train the coclustering for the variables ``Country`` and ``Discipline`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Call ``fit`` parameters with the following parameters: - ``X=X_tokyo[["Country", "Discipline"]]`` - ``id_column="Country"`` .. code:: ipython3 khcc_tokyo = KhiopsCoclustering() khcc_tokyo.fit(X_tokyo[["Country", "Discipline"]], id_column="Country") .. raw:: html
KhiopsCoclustering()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Obtain the number and names of the clusters of the ``Country`` dimension ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 tokyo_clusters = khcc_tokyo.model_report_.coclustering_report.get_dimension( "Country" ).clusters tokyo_leaf_clusters = [cluster for cluster in tokyo_clusters if cluster.is_leaf] print(f"Number of leaf clusters: {len(tokyo_leaf_clusters)}:") for index, cluster in enumerate(tokyo_leaf_clusters): print(f"cluster {index:02d}: {cluster.name}") .. parsed-literal:: Number of leaf clusters: 39: cluster 00: {Ghana, Kosovo, Republic of Moldova, ...} cluster 01: {Jamaica, Ethiopia, Trinidad and Tobago, ...} cluster 02: {Kenya, Fiji} cluster 03: {Uzbekistan, Azerbaijan, Mongolia, ...} cluster 04: {Serbia, Islamic Republic of Iran} cluster 05: {Turkey, Tunisia, Venezuela, ...} cluster 06: {Chinese Taipei, Thailand, Indonesia, ...} cluster 07: {Switzerland, Austria, Hong Kong, China, ...} cluster 08: {Colombia, Morocco, Ecuador, ...} cluster 09: {Ukraine, Belarus, Slovakia} cluster 10: {Kazakhstan, Croatia, Greece} cluster 11: {Japan} cluster 12: {Argentina} cluster 13: {Republic of Korea} cluster 14: {Egypt} cluster 15: {Israel, Dominican Republic} cluster 16: {Mexico} cluster 17: {Zambia, Saudi Arabia, Honduras, ...} cluster 18: {Romania} cluster 19: {Great Britain, Ireland} cluster 20: {New Zealand} cluster 21: {Australia} cluster 22: {Canada} cluster 23: {People's Republic of China} cluster 24: {United States of America} cluster 25: {Italy} cluster 26: {Poland, Lithuania} cluster 27: {Germany, Belgium} cluster 28: {India} cluster 29: {Czech Republic, Nigeria, Slovenia, ...} cluster 30: {Spain} cluster 31: {South Africa} cluster 32: {Netherlands} cluster 33: {ROC} cluster 34: {Hungary, Montenegro} cluster 35: {Norway, Denmark, Portugal} cluster 36: {Sweden, Angola, Bahrain} cluster 37: {Brazil} cluster 38: {France} Print the members of one of the clusters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 print(f"Members of the cluster {tokyo_leaf_clusters[29].name}:") for value_obj in tokyo_leaf_clusters[29].leaf_part.values: print(value_obj.value) .. parsed-literal:: Members of the cluster {Czech Republic, Nigeria, Slovenia, ...}: Czech Republic Nigeria Slovenia Puerto Rico **Check the results with the covisualization app** .. code:: ipython3 tokyo_report = os.path.join("exercises", "tokyo.khcj") khcc_tokyo.export_report_file(tokyo_report) # To visualize uncomment the lines below # khcc_tokyo.export_report_file("./tokyo_report.khcj") # kh.export_report_file("./tokyo_report.khcj") Deploy the learned coclustering model on the training data and check the obtained clusters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 tokyo_predictions = khcc_tokyo.predict(X_tokyo[["Country", "Discipline"]]) print("Predicted clusters (first 10)") display(tokyo_predictions[:10]) .. parsed-literal:: Predicted clusters (first 10) .. parsed-literal:: array([['Norway', '{Norway, Denmark, Portugal}'], ['Spain', '{Spain}'], ['Italy', '{Italy}'], ['France', '{France}'], ['Chile', '{Zambia, Saudi Arabia, Honduras, ...}'], ['Sudan', '{Ghana, Kosovo, Republic of Moldova, ...}'], ['Islamic Republic of Iran', '{Serbia, Islamic Republic of Iran}'], ['Azerbaijan', '{Uzbekistan, Azerbaijan, Mongolia, ...}'], ['Netherlands', '{Netherlands}'], ['Australia', '{Australia}']], dtype=object)